Problem and Goal

Abalone, a type of marine mollusk, are valued for their meat and shells, making accurate age determination crucial for sustainable fisheries management. Traditional methods of age determination, involving the manual counting of growth rings on shells, are labor-intensive and prone to error due to variability in ring deposition and shell wear. To address these challenges, this study aims to predict the age of abalone using non-destructive methods based on physical measurements.

The age of an abalone is conventionally estimated by counting growth rings on its shell and adding 1.5 years, as each ring represents approximately one year of growth. Leveraging this relationship, our goal is to develop a robust regression model that accurately predicts abalone age using measurable attributes such as length, diameter, height, and various weights (whole, shucked, viscera, and shell).

By establishing a reliable predictive model, we aim to streamline age estimation processes in abalone fisheries management, contributing to sustainable harvesting practices and conservation efforts.

Data is from UCI machine learning repository. https://archive.ics.uci.edu/dataset/1/abalone

Load Data

First, the data must be loaded. To do so, I first specified where the data can be found. The data was read into abalone_data with read.csv. Column names were specified from given information about the data.

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"

abalone_data = read.csv(url, header = FALSE, col.names = c("Sex", "Length", "Diameter", "Height", "Whole weight", "Shucked weight", "Viscera weight", "Shell weight", "Rings integer"))
head(abalone_data)
##   Sex Length Diameter Height Whole.weight Shucked.weight Viscera.weight
## 1   M  0.455    0.365  0.095       0.5140         0.2245         0.1010
## 2   M  0.350    0.265  0.090       0.2255         0.0995         0.0485
## 3   F  0.530    0.420  0.135       0.6770         0.2565         0.1415
## 4   M  0.440    0.365  0.125       0.5160         0.2155         0.1140
## 5   I  0.330    0.255  0.080       0.2050         0.0895         0.0395
## 6   I  0.425    0.300  0.095       0.3515         0.1410         0.0775
##   Shell.weight Rings.integer
## 1        0.150            15
## 2        0.070             7
## 3        0.210             9
## 4        0.155            10
## 5        0.055             7
## 6        0.120             8

Exploratory Data Analysis

Histogram generated by ggplot, specifically a geom_histogram plot. Rings.integer is the variable counted.

ggplot(abalone_data, aes(x = Rings.integer)) + 
  geom_histogram(binwidth = 1, fill = "blue", color = "black") + 
  labs(title = "Distribution of Rings (Age) in Abalones", x = "Rings", y = "Frequency")

Rings vs. Sex

ggplot() creates a boxplot (geom_boxplot()) to visualize the distribution of Rings integer grouped by Sex. This helps in understanding if there are differences in age distribution based on abalone sex. Sexes include male, female, and infant.

ggplot(abalone_data, aes(x = Sex, y = Rings.integer, fill = Sex)) +
  geom_boxplot() +
  labs(title = "Boxplot of Rings by Sex", x = "Sex", y = "Rings")

Correlation Matrix

cor() computes the correlation coefficients between numeric variables. Visualizing the correlation matrix as a heatmap (ggplot() with geom_tile()) provides a clear overview of the strength and direction of correlations among variables. I converted to plotly to be interactive and to easily know correlation value.

Full plot and code will be on next slide.

Correlation Matrix

cor_matrix = cor(abalone_data[, 2:9])
p = ggplot(as.data.frame(as.table(cor_matrix)), aes(Var1, Var2, fill = Freq)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "red") +
  labs(title = "Correlation Matrix of Abalone Data", x = "", y = "")
ggplotly(p, tooltip = c("Var1", "Var2", "Freq"), text = c("Freq"))

Regression Analysis

lm() fits a linear regression model where Rings integer is the dependent variable and all other variables (Length, Diameter, Height, etc.) are predictors (~ .). summary() provides detailed statistics of the regression model. Sex must be converted to factor using as.factor.

abalone_data$Sex = as.factor(abalone_data$Sex)
model = lm(Rings.integer ~ ., data = abalone_data[, -1])
model_summary = summary(model)
f_stat = model_summary$fstatistic
significance = model_summary$coefficients
f_stat
##     value     numdf     dendf 
##  665.2439    7.0000 4169.0000
significance
##                  Estimate Std. Error     t value      Pr(>|t|)
## (Intercept)      2.985154  0.2691263  11.0920209  3.389437e-28
## Length          -1.571897  1.8247596  -0.8614271  3.890524e-01
## Diameter        13.360916  2.2370773   5.9724874  2.531444e-09
## Height          11.826072  1.5481284   7.6389479  2.699658e-14
## Whole.weight     9.247414  0.7326444  12.6219690  7.182198e-36
## Shucked.weight -20.213913  0.8233103 -24.5519999 1.921647e-124
## Viscera.weight  -9.829675  1.3040047  -7.5380678  5.817733e-14
## Shell.weight     8.576242  1.1367360   7.5446209  5.536306e-14

Actual vs Predicted Plot

We need to assess how well we can predict the number of rings (Age) of abalone. This will be done by comparing the regression model and the actual values of Ring integer. predict() uses the results from model made.

abalone_data$predicted = predict(model)
ggplot(abalone_data, aes(x = Rings.integer, y = predicted)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, color = "blue", linetype = "dashed") +
  labs(title = "Actual vs Predicted Rings", x = "Actual Rings", y = "Predicted Rings")

Analysis and Conclusion

Age: The distribution for age looks slightly skewed right. This could be that growth for abalone is faster in earlier years, and as they age their growth rate decreases.

Rings vs Sex: There is a clear distinction that infants have less rings than adult males and females. This can be seen by infant IQR lies lower than male and female. The median value between males and females appears to be the same and their respective IQRs span similar values. Females do exhibit a difference from males where all outliers have higher values than the IQR.

Correlation Matrix: I will be examining the top row only to find relationships between “Rings integer” and other variables. Frequency legend tells us low correlation is white, while high correlation is red. “Shell weight” vs “Rings integer” is a notable box since it has a significant strong correlation of 0.62. The other variables vs “Ring integer” have moderate correlations near 0.50.

Regression Analysis: - All predictor variables except Length have highly significant coefficients (*** indicates very high significance). Specifically: Diameter, Height, Whole weight, Shucked weight, Viscera weight, and Shell weight are highly significant predictors (p < 0.001). - F-statistic tests the overall significance of the model. A large F-statistic (665.2) with a very low p-value (< 2.2e-16) indicates that the model as a whole is highly significant.

Actual vs Predicted: A perfect, ideal plot in this case would have each data point along the y=x line. The compared data does follow the line until it predicts older (more rings) abalones where there is more variance. Overall, the model does predict abalone age fairly well from physical indicators.

Steps for Further Analysis: To deepen the analysis of the abalone dataset, we can explore the relationship between Shell weight and Rings integer using scatter plots and correlation analysis. Additionally, we’ll investigate how Shell weight, Shucked weight, and Whole weight collectively predict Rings integer through multiple regression modeling.